Supervised By: Dr. Arthur Petrosian
Written By: Emma Movsesyan
When I was choosing the theme on my capstone project I was sure about two things:
I am glad that with the help of the Data Science course and the project, I was able to reach the desired result. As I consider the connection shown below trully exciting,let me first of all visualize the connection among different courses that helped me to accomplish my capstone project.
So, from the graph above we may infere that my capstone project consists of two main parts.
Both of them are accomplished using SoloLearn’s datasets.
SoloLearn is an Armenian StartUp aimed to teach coding to everyone from anywhere and from any background. It’s a mobile code learning platform, which can be used by anyone who has the desire to learn coding. https://www.sololearn.com/.
As the data of SoloLearn is really big, the decision was made to subset it’s 10.000.000 users data to 100.000 and do the visualizations on that smaller data.
For visualization purposes the region’s from where SoloLearn top 20 users come from were taken. Then the data on each of those regions have been collected : the total number of users in that particular region, then the number of users by their level(in app level) in that region and this was visualized using different interactive R packages.
Geo=gvisGeoChart(continents, locationvar="country_code", colorvar="users_total",
options=list(colors="['#aeff04', '#9eea00', '#6a9e00']",
title="You can see from which continents are the top users!",
titleTextStyle="{color:'green',fontName:'Courier',fontSize:16}",
bar="{groupWidth:'100%'}"))
plot(Geo)
Pie <- gvisPieChart(sub_cont,
options =list(
is3D=TRUE,
pieStartAngle=300,
title="Number of users from top 5 continents!",
titleTextStyle="{color:'green',fontName:'Courier',fontSize:16}",
bar="{groupWidth:'100%'}"))
plot(Pie)
In this graph the statistics of 4 continents: Armenia, Turkey,Kenia and Antarctica is presented.Yes, Artarctica, we have 3 users from Antarctica, which is something we were not expecting to find :)
Preparing data for visualizing with BubblePlot.
continents<-continents[which(
continents$country_code=="am" |
continents$country_code=="tr" |
continents$country_code=="aq" |
continents$country_code=="ke"),]
melted_continenets<-melt(continents, id.vars = "country_code",
measure.vars = c("l1", "l2","l3", "l4",
"l5", "l6","l7", "l8",
"l9", "l10","l11", "l12",
"l13", "l14","l15", "l16"))
## Warning in melt_dataframe(data, as.integer(id.ind - 1),
## as.integer(measure.ind - : '.Random.seed' не является целочисленным
## вектором, он типа 'NULL', и поэтому пропущен
melted_continents<-melted_continenets[-which(melted_continenets$value==0),]
names(melted_continents)[2]<- "levels"
melted_continents$levels<-as.numeric(melted_continents$levels)
Bubble <- gvisBubbleChart(melted_continents, idvar="country_code",
xvar="levels", yvar="value",sizevar="levels",
colorvar="country_code",
options =list(
colors="['#aeff04', '#9eea00', '#6a9e00']",
title="BubbleChart for Antarctican, Kenyan, Armenian, Turk users.",
titleTextStyle="{color:'green',fontName:'Courier',fontSize:16}",
bar="{groupWidth:'100%'}")
)
plot(Bubble)
Armeinan users by their levels are presneted below.
column_chart <- gvisColumnChart(continents, xvar="country_code",
yvar=c("l1", "l2","l3", "l4",
"l5", "l6","l7", "l8",
"l9", "l10","l11", "l12",
"l13", "l14","l15", "l16"),
options=list(
title="Armenian Users Advancement in SoloLearn",
titleTextStyle="{color:'green',fontName:'Courier',fontSize:16}",
bar="{groupWidth:'100%'}")
)
plot(column_chart)
You can get familiar with the app, via running the UI file from the package sent.
The nature of the problem.
SoloLearn is a mobile code-learning platform,where people from different spheres, backgrounds and cultures learn to code. According to the SoloLearn users, one of the greatest features that app provides is a discussion forum. This is the place where the users ask questions,share and exchange their knowledge. But as a user base is big and multicultural(as we could infere from the visualization part) sometimes the discussions on the forum do not fit into the scope of the coding content. So, our moderators spend a long time for filterring those discussions. But as the apps data is growing exponentially, after learning about Text Classification,I thought it would be great to apply it and automize this process of filtering bad(spammy) comments in the discussions.
As the moderators were spending time on classifying the comments by hand, we have already classified data set, which can be used for training our classifier.
There are different classifiers that can be chosen for this particular problem, but in the process of research the conclusion was made, that among less computer intensive classifiers, Naive Bayes is doing a really good job and it can be the one to be applied to the SoloLearn comments classification problem.
Note: In the reference part You can get familiar with the sources that had influenced the classifier’s choice.
In this part we will try to understand how Bayesian classification is performed on a small data taken from the SoloLearn discusssions.
So, let’s suppose we have six comments given below, from which four are spam and two are non-spam(ham) comments.And our goal is to predict if a new(non-classified) comment is going to be a spam or not. Comments:
new comment we need to classify: please, send me your js code of web paint project…
Now, as we have a problem well formulated, let’s go step by step through the process of constructing the classifier.
Fisrt:
We need to calculate the prior probability of spam and ham.
P(spam)= num of spam in data / number of total data = 4/6
P(ham)= num of ham in data / number of total data = 2/6
Second:
We need to take all the individual words that has ever been seen in comments. Then we need to build a basis vocabulary from that words and count how many times we have encountered that particular word in spam vs in ham comments. The vocabulary constructed on data above is given below.
| spam | ham | word |
|---|---|---|
| 2/4 | 1/2 | send |
| 2/4 | 0/2 | number |
| 1/4 | 2/2 | code |
| 1/4 | 1/2 | please |
| 1/4 | 1/2 | review |
Note: we are not going to put in our vocabulary the words that appear less than 2 times in our data, and the words like [me, your, is, an etc.] as they are not going to convey important information. So, they will not have a positive effect on our classifier.
In Natural Language Processing this process is called cleaning the data. More details of this process will be covered in the cleaning part of the real SoloLearn data.
Third:
Calculating the Likelihood Probabilities. So, we are going to predict what is the probability of our new comment: “please, send me your js code of web paint project…” being spam vs ham. We need to eliminate the “new” words from our new comment and let only the one’s that we have information about in our vocabulary.So let’s see the conversion below.
please, send me your js code of web paint project… -> please send code
Now, we must convert the comment to attribute value representation, according to the basis vocabulary words that we have. For the word that has appeared in a vocabulary and in a comment we put 1 and 0 if it has not appeared in a comment.
Comment's Attribute Value Representation :please send code -> 10110
Calculating the Likelihood:
P(please send code/spam)=P(10110/spam)=(2/4)(1-2/4)(1/4)(1/4)(1-1/4)=0.012
P(please send code/ham)=P(10110/ham)=(1/2)(1-0)(1)(1/2)(1-1/2)=0.125
Finally! We are ready to apply Bayes Rule for Classification:
https://en.wikipedia.org/wiki/Naive_Bayes_classifier
Bayes Rule: posterior= prior X likelihood / evidence
Calculating Posterior probabilities:
P(spam/10110)=(0.67)(0.012)/(0.012)(0.67) + (0.125)(0.34)=0.16
P(ham/10110)=(0.34)(0.125)/(0.012)(0.67) + (0.125)(0.34)=0.84
So, from the posterior probabilities calculated above we are happy to conclude that the probability of the new comment :please, send me your js code of web paint project… being spam is 0.16 and being ham is 0.84. Therefore, we can conclude that the new comment is more likely to be a ham.
Which is a good result as we want this kind of comment to stay in SoloLearn app’s discussion forum.
Now, as we have understood the intuition under the Naive Bayes classifier we can use libraray(e1071) for constructing the classifier in R.
But before there are couple of things we need to do. Below are provided the steps for the solution of our Classification Problem:
commentsData_corpus_clean <- tm_map(commentsDataCorpus,
content_transformer(tolower))
commentsData_corpus_clean <- tm_map(commentsData_corpus_clean,
removeNumbers)
commentsData_corpus_clean <- tm_map(commentsData_corpus_clean,
removeWords,stopwords())
commentsData_corpus_clean <- tm_map(commentsData_corpus_clean,
removePunctuation)
Now that the data are processed to our liking, the final step is to split the comments into individual components through a process called tokenization. - A token is a single element of a text string; In this case: token == words. Finally, we create a Document Term Matrix (DTM) in which rows indicate documents (comments comments) and columns indicate terms (words).
Note: The order of cleaning steps matters!
comments_dtm <- DocumentTermMatrix(commentsData_corpus_clean)
comments_dtm_train <- comments_dtm[1:4169,]
comments_dtm_test <- comments_dtm[4170:5559,]
comments_train_labels <- comments[1:4169,]$type
comments_test_labels <- comments[4170:5559,]$type
prop.table(table(comments_train_labels))
## comments_train_labels
## ham spam
## 0.8776685 0.1223315
prop.table(table(comments_test_labels))
## comments_test_labels
## ham spam
## 0.8856115 0.1143885
library("RColorBrewer")
library("wordcloud")
## Warning: package 'wordcloud' was built under R version 3.3.3
wordcloud(commentsData_corpus_clean,min.freq = 50,
random.order = FALSE,colors=brewer.pal(8, "Dark2"))
spam <- subset(comments,type=="spam")
ham <- subset(comments,type=="ham")
wordcloud(spam$text,max.words = 40,
scale = c(3,0.5),colors=brewer.pal(8, "Dark2"))
wordcloud(ham$text,max.words = 40,
scale = c(3,0.5),colors=brewer.pal(8, "Dark2"))
The Naive Bayes implementation we will employ is in the e1071 package
We will build our model on the comments_train matrix
The comments_classifier object now contains a naiveBayes classifier object that can be used to make predictions.
comments_train <- apply(comments_dtm_freq_train,MARGIN = 2,convert_counts)
comments_test <- apply(comments_dtm_freq_test,MARGIN = 2,convert_counts)
#training the model
library(e1071)
comments_classifier <- naiveBayes((comments_train),as.factor(comments_train_labels))
To evaluate the comments classifier, we need to test its predictions on unseen comments in the test data. Recall that the unseen comment features are stored in a matrix named comments_test, while the class labels (spam or ham) are stored in a vector named comments_test_labels. The classifier that we trained has been named comments_classifier. We will use this classifier to generate predictions and then compare the predicted values to the true values.
comments_test_pred <- predict(comments_classifier,comments_test)
library("gmodels")
## Warning: package 'gmodels' was built under R version 3.3.3
CrossTable(comments_test_pred,comments_test_labels,
prop.chisq = FALSE,
prop.t = FALSE,
dnn = c("predicted","actual"))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1390
##
##
## | actual
## predicted | ham | spam | Row Total |
## -------------|-----------|-----------|-----------|
## ham | 856 | 28 | 884 |
## | 0.968 | 0.032 | 0.636 |
## | 0.695 | 0.176 | |
## -------------|-----------|-----------|-----------|
## spam | 375 | 131 | 506 |
## | 0.741 | 0.259 | 0.364 |
## | 0.305 | 0.824 | |
## -------------|-----------|-----------|-----------|
## Column Total | 1231 | 159 | 1390 |
## | 0.886 | 0.114 | |
## -------------|-----------|-----------|-----------|
##
##
From the table above we can see that there are 321 ham comments mislassifiedas spam, and 27 spam comments misclassified as ham. So, overall from 1390 comments we have 348 misclassified. So, our accuracy rate is approximately - 75%.
Note: Those results may slightly vary, as in every code execution new random data set is generated.
The percent of accuracy reached via Naive Bayes is quite good, but I hope by applying more advanced cleaning and language processing techniques the accuracy results will increase.
The exploration and application of that techniques on this problem is one area I am going to work in future. I am also planning to apply more computer intensive classifiers such as Neural Nets on this problem.
AUA Data Science Course
AUA Machine Learning
Flowing Data
DiagrammeR
RGraphGallery
RVisualization
RShiny
Data Science Specialization(9 courses)
Stanford University ML
An Introduction to Statistical Learning
Statistical Learning Channel